This dataset consist of all the tracks that currently exists on my Spotify Library. In other words, all of the tracks (or group of them) that I’ve liked. Told Spotify that I’m into that song. Each row consist of a single track i.e. an event registering. The goal of this analysis is to delve into and explore on how the habit of music liking on this specific platform has changed. Even though I, being the one feeding the data, have an overall ideia on how things changed, I have a feeling that some surprises might arise. Or not. But I’m really into music, to say that I listen daily is far from being an overstaement. As far as remembered to like tracks that I indeed liked, Spotify has been my main gateway to music listening, so a realiable data source.
Something that can be helpful if done before our analysis is spotting and correcting for missing values. So we’ll start by omitting them
## [1] FALSE
Are there any missing values? FALSE
Let’s take a look on all the variables and define which
## [1] "X" "added_at" "artist_id"
## [4] "artist_name" "duration_ms" "explicit"
## [7] "id" "name" "popularity"
## [10] "acousticness" "danceability" "duration_ms.1"
## [13] "energy" "id.1" "instrumentalness"
## [16] "key" "liveness" "loudness"
## [19] "mode" "speechiness" "tempo"
## [22] "time_signature" "valence"
It’d be interesting to analyze:
[key, mode, danceability, liveness, speechiness, valence, loudness, tempo, instrumentalness, popularity]The question is rapid: how is my artist selection ranking?
That’s interesting. I couldn’t remember adding “Daniel Grau” songs but if someone else analyzing this data would assume I’m a huge fan (which I’m really not). J Cole, on the other hand, really is an artist that I really enjoy.
We can definately observe that the majority of artists have only one track saved. But it made me wonder how high this ratio is.
It is indeed really large. 99% of artists from my library have, at most 2, tracks there.
It’s important to know the span of the data. Which was the first song and which was the latest?
Firts we need to convert from Factor to a date format.
## [1] 3.232877
That’s approximately 3 years and 2 months worth of music saving.
This part really interests me and, since there are a fair amount of features, I’m condensing those analysis. I’m focusing on features that might hold interesting information: Danceability, energy, instrumentalness, speechiness, popularity and acousticness. These information can tell a lot of a given track, let alone a library of them.
It’s important to point out that most of these features that have a [0, 1] range holds information on its magnitude. For example, if a songs has a 0.95 instrumental grade, it means that it’s as purely instrumental as it can be. On the other hand, if a song is graded 0.5 on energy, means it’s an average-energetic song. Most of those features are extracted by extensive waveform analysis.
## danceability energy speechiness instrumentalness
## Min. :0.1030 Min. :0.0332 Min. :0.02500 Min. :0.0000
## 1st Qu.:0.6780 1st Qu.:0.5200 1st Qu.:0.04400 1st Qu.:0.2465
## Median :0.7570 Median :0.6620 Median :0.05450 Median :0.8300
## Mean :0.7263 Mean :0.6441 Mean :0.08588 Mean :0.6290
## 3rd Qu.:0.8020 3rd Qu.:0.7850 3rd Qu.:0.07845 3rd Qu.:0.8950
## Max. :0.9830 Max. :0.9910 Max. :0.68100 Max. :0.9720
## popularity acousticness
## Min. : 0.00 Min. :0.0000055
## 1st Qu.: 6.00 1st Qu.:0.0017400
## Median :19.00 Median :0.0159000
## Mean :21.13 Mean :0.1360420
## 3rd Qu.:33.00 3rd Qu.:0.1455000
## Max. :78.00 Max. :0.9700000
This distribution is of great interest. And it’s no suprise that this plot is skewed left. That is, most of the songs lie on the [0.7, 0.8] range which enlightens how much I’ve saved dance musics.
Perhaps energy has some latent information with danceability? Even though this distribution is not really left skewed, most of the songs are above 0.5 mark.
It’s clear that the songs that I’ve saved don’t have much vocals. Distribution is highly right skewed and also demonstrates a clear pattern of the dataset.
It’s also no news that most songs do have a high instrumentalness ratio. However, it is surprising that relatively high count of 0 values. That is because I know my taste and appreciation for instrumental songs.
Perhaps acousticness share latent information with speechiness. A clearly right skewed plot with most values ranging from [0, 0.2].
These plots are enlightening. They clearly show a pattern here: I tend to like tracks that are danceable, energetic, mostly instrumental with very few vocals. This came somewhat expectedly, due to my taste in electronic music. But I also tend to listen to rap and hip hop, that have vocals. It’s not reflecting on my saved tracks. Perhaps I don’t like it that much?
Looks like we have, besides the average, many tracks on D minor and G; also very little tracks on E minor.
The chosen dataset is relatively small one, with 923 records but also relatively dense: 23 features. Each row consist of an event of track saving on my Spotify library. Most of the features have values ranging from 0 to 1. It’s the magnitude of a given feature e.g. instrumentalness indicates, from 0 to 1, on how instrumental a track is.
Features I’m most interested in have had their distributions plotted above. They are: Danceability, energy, instrumentalness, speechiness, popularity and acousticness. Even though there are many features, a few do might hold some information on how my taste have evolved, which are the ones I selected.
For this first brief analysis I only had to take out a single missing value. Besides that, variables had been analyzed alone, so no feature engineering was done.
Now we’re ready to explore how the features relates with themselves. As of now, we’ll limit to pair – bivariate – analysis. The first thing that come to mind is to explore the time dimension of this dataset.
Now we can analyze grouping by date:
It clearly has an increase from the middle of 2017 on. If we grouped this data monthly we’d see a trend maybe?
I was almost right. July 2017 was the month I added 172 songs. That’s 18% of the entire dataset volume. You can see that the second highest is July 2018. Maybe there’s a pattern there?
The graph confirms our assumption: fore some reason songs are added more on the second semester than the first one.
How much danceability correlates with popularity?
Well it seems that not so much. By analyzing the graph and density points we can tell that most songs that are danceable do not rank high in popularity.
It seems that tracks that are danceable apparently have high energy.
There seems to be a correlation between a track’s energy and how loud it is. A little (too much) expected. But how correlated are they?
##
## Pearson's product-moment correlation
##
## data: energy and loudness
## t = 22.152, df = 921, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5458111 0.6301299
## sample estimates:
## cor
## 0.5895744
And indeed it is a high correlation: 0.589 on Pearson’s grade. From this we can tell that both features carry the same latent information.
Danceability have interesting relationships: roughly speaking, tracks that are less popular are more danceable; and tracks that have more energy are more danceable. Important to notice that this is a specific set from a larger dataset and it might be the case that I simply don’t listen to a lot of popular tracks, so this relationship holds in reference to this set.
It seems that there might be clues on how instrumental a song just by looking at its key. Indicates that might hold some interest on what kind of instrumental tracks I roughly prefer: low pitched (or bass-rich) tracks.
I suspect that a few features have latent relationships with danceability. And since understanding what lies behind this feature is of much interest, let’s plot it:
When I noticed that a lot of features have the same range ([0.0, 1.0]). So what does an average track sampled from my library would look like?
And it’s confirmed: I like musics that are danceable, energetic and with lots of instrumentals. Also, these tracks tend to be low on acoustic and speechiness and not popular.
If we were to plot, aggregating monthly, the average on each feature, how would these features place evolve (if they evolve at all) through time?
Well, that before metioned evolution don’t exist. Those time series are rather stochastic. But when it comes to danceability and energy, it stands above a baseline. Meaning that throughout this time my taste for energetic and danceable tunes remained rather the same. It’s interesting also to note how my listening to acoustic songs vary over time.
That scatterplot above was more confusing than elusive. Now this heatmap is great at displaying patterns. It is crystal clear on how each feature, on average, changed month to month. The assumptions behind danceability, energy and acousticness are clearly verifiable.
Danceability feature describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm, stability, beat strength and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. Popularity is based on the number of streams a given song has. It ranges from 0 to 100, being 0 a song with little to no-streams and 100 a song featuring on the Top 200 Charts.
The relationship between how danceable a track is and how popular it often is striking to me. Making the points more opaque and with density plots we have a clear understanding of both distributions and the relationship between them. It’s also worth noting that I myself am not a very mainstream listener but nevertheless, it’s clear that popular songs don’t tend to be too danceable.
Loudness represents the overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values range from -60 and 0 dB. Energy is a measure from 0.0 to 1.0 and represnt a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
A previous assumption that I had that as greatly confirmed. The energiness of a song is linearly dependent on how loud it is. This discovery is important to me because I’m an aspiring DJ and unraveling those kind of relationships are of great help to better understand how to classify songs.
This timeline tile-map shows that my taste over the years didn’t change much. Which was a surprise to me. I had the belief I was an ecletic and abroad listener. But the data doesn’t lie and even though instrumentalness and valence clearly varied throughout the months,
This analysis was of great benefit for me! I was able to get a more realistic view on what kind of music I’ve been enjoying for the past 4 years and how that taste changed. Or not. As I mentioned earlier, this has an extra importance for me since now I’m focusing on a more serious DJ enterprise, having this kind of crude analysis of songs (provided by Spotify) enhances how I understand music.
I wish I was more proficient on R language in order to write and create more professionally. But I value these kinds of tasks that take me out of my comfort zone to learn new things, which was the idea of this whole course. I’ve had a few issues on data transformation and overall language syntax and writing styles but I think I managed to get it done.
I also am aware that many more analysis could have been done. There was the genre feature that I missed. I could have scraped to get it as well but time was a constraint here.